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Abstract 

Efficient  utilization  of  high-performance  computers  require  good  parallelism  detection  and  program 
partitioning  techniques  followed  by  efficient  scheduling  of  partitioned  tasks.  In  this  work,  we  ad¬ 
dress  issues  in  parallelism  specification  and  detection,  particularly  related  to  Object-Oriented(OO) 
programs.  We  have  proposed  solutions  to  overcome  inheritance  anomaly  in  Concurrent  00  Lan¬ 
guages.  We  have  also  proposed  a  novel  type-inference  mechanism  for  static  type  determination  of 
objects  in  00  programs  and  have  developed  a  precise  caU-graph  construction  technique.  Moreover, 
we  have  developed  efficient  task  scheduling  algorithms  which  produce  an  optimal  schedule  given 
sufficient  number  of  processors.  The  duplication-based  scheduhng  algorithm  scales  down  nicely  if 
number  of  available  processors  is  not  sufficient. 


Foreword 

A  wide  variety  of  high-performance  computers  with  radically  new  architectures  have  been  evolv¬ 
ing  rapidly  over  the  last  few  years.  This  stupendous  computing  power  is  provided  by  machines 
ranging  from  networked  workstations  to  superscalar  or  VLIW  machines.  Particularly  noticeable 
among  these  architectures  are  the  distributed  memory  machines(DMM)  whose  use  can  be  seen  in 
a  wide  variety  of  applications  including  fiuid  flow,  weather  modeling,  database  systems  etc.  These 
advancements  in  hardware  have  put  demands  on  the  software  community  to  develop  efficient  con¬ 
current  software  systems  so  that  this  enormous  computing  power  could  be  utilized  to  the  maximum 
possible  extent.  This  requires  development  of  suitable  programming  paradigms,  parallelism  detec¬ 
tion  techniques  and  efficient  task  partitioning  and  scheduling  strategies.  The  task  partitioning 
algorithm  partitions  the  application  into  separate  tasks  by  detecting  the  parallelism  in  the  pro¬ 
gram  and  represents  them  in  the  form  of  a  Directed  Acyclic  Graph  (DAG).  After  the  apphcation 
is  transformed  to  a  DAG,  the  tasks  are  scheduled  onto  the  processors. 

On  the  other  hand,  on  the  language  front,  the  use  of  Object-Oriented(OO)  languages  has  become 
widespread  in  every  field  of  computing  ranging  from  application  programming  to  operating  systems. 
The  00  paradigm  provides  the  tools  and  facilities  for  developing  softwares  that  are  easier  to  build, 
extend,  reuse,  modify,  and  maintain. 


This  work  deals  with  detecting  parallelism  and  automatically  partitioning  the  applications,  espe¬ 
cially  keeping  in  view  00  programs;  and  then  scheduling  the  partitioned  tasks  efficiently  to  reduce 
overall  execution  time. 


Statement  of  the  Problem  Studied 


Performing  a  parallel  computation  on  a  DMM  can  be  done  in  two  steps.  The  first  step  deals  with 
specifying  parallelism  and  detecting  parallel  tasks  in  an  application.  The  second  step  consists  of 
scheduling  these  parallel  tasks.  We  discuss  some  issues  involved  with  these  two  steps  in  detail 
below. 

1.  Parallelism  Specification  and  Detection:  Properties  like  inheritance  and  dynamic  binding 
of  objects  pose  numerous  hurdles  in  correctly  specifying  and  detecting  parallelism  in  00 
programs.  One  approach  to  specify  parallelism  is  to  use  a  concurrent  00  language  with  con¬ 
structs  for  denoting  concurrency.  But  this  leads  to  a  major  inconsistency  named  inheritance 
anomaly  forcing  the  programmer  to  redefine  classes.  A  major  goal  of  this  work  is  to  solve  in¬ 
heritance  anomaly.  Another  hurdle  in  the  path  of  detecting  parallelism  is  the  lack  of  program 
point  specific  type  information.  It  has  been  proved  that  static  type  determination  for  C-f-f 
is  NP-hard.  With  insufficient  type  information  for  static  analysis,  many  of  the  traditional 
optimization  and  parallelizing  techniques  are  rendered  useless  for  00  languages.  One  prime 
objective  of  this  work  is  to  propose  a  solution  to  this  type  determination  problem.  In  order 
to  expose  the  parallelism  in  a  program  to  the  maximum  extent,  an  automatic  compiler  must 
perform  control  and  data  flow  analysis  beyond  procedural  boundaries.  The  backbone  of  such 
an  interprocedural  analysis  is  a  precise  call-graph.  But  due  to  lack  of  exact  type  information, 
construction  of  a  precise  call-graph  for  00  programs  is  not  possible.  This  work  also  intends 
to  find  out  viable  alternative  methods  for  precise  call-graph  construction  for  00  programs. 

2.  Scheduling:  Task  scheduling  is  one  of  the  key  elements  in  any  DMM.  One  of  the  major 
limitations  of  DMMs  is  the  high  cost  for  interprocessor  communication  which  can  be  reduced 
by  an  efficient  scheduling  algorithm.  A  primary  goal  of  this  work  is  to  reduce  the  overall 
execution  time  which  includes  communication  as  well  as  execution  time.  It  has  been  proven 
that  optimal  scheduling  of  tasks  onto  DMMs  is  an  NP-complete  problem,  and  several  heuristic 
based  approaches  have  been  explored.  This  work  intends  to  investigate  the  trade-off  between 
the  schedule  length  and  the  required  number  of  processors.  A  prime  objective  is  to  find 
the  optimium  schedule  for  DMMs  if  certain  conditions  are  satisfied  and  if  adequate  number 
of  processors  are  available.  However,  the  system  might  not  have  the  required  number  of 
processors  to  produce  the  optimal  schedule.  Therefore,  it  is  necessary  to  develop  an  algorithm 
which  scales  down  to  produce  the  optimal  algorithm  for  a  given  number  of  available  processors. 


Significant  Accomplishments 


1.  We  have  mentioned  in  the  statement  of  the  problems  the  necessity  of  overcoming  inheritance 
anomaly  for  concurrent  00  languages.  We  have  proposed  a  task-parallel  language  based  on 
C-t-+  called  CORE  which  solves  inheritance  anomaly. 


2.  We  have  discussed  earlier  that  it  is  extremely  important  to  know  the  exact  type  of  an  object 
at  a  particular  program  point  to  apply  any  parallelization  technique.  The  existing  type 
inference  algorithms  fail  to  address  this  problem  adequately  enough  for  00  programs.  We 
have  proposed  an  approach  named  SSAInfer  which  transforms  programs  into  static  single 
assignment  (SSA)  form  before  any  type  inference  mechanism  is  applied.  This  SSA-based 
approach  combined  with  other  constraint-based  type  inference  methods,  produces  program- 
point  specific  as  weU  as  sharper  types.  Another  property  of  SSAInfer  is  that  it  is  language- 
independent  and  can  be  used  in  conjunction  with  other  existing  methods. 

3.  We  have  also  proposed  another  type  analysis  approach  named  ITA.  ITA  performs  an  incremen¬ 
tal  type  analysis  within  the  constraint-based  framework  for  restoring  correct  type  information 
for  type  variables  after  program  transformations.  ITA  blends  very  well  with  other  compiler 
optimizations,  such  as  constant  propagation,  in  improving  types. 

4.  Besides  type  analysis  methods,  we  have  also  explored  complementary  approaches  to  do  aggres¬ 
sive  parallelization  of  00  programs.  As  mentioned  earlier,  failure  to  build  a  precise  cad-graph 
hinders  interprocedural  analysis  of  00  programs.  Apart  from  lack  of  type  information,  this  is 
also  due  to  presence  of  virtual  functions  and  inheritance.  We  have  proposed  and  implemented 
an  algorithm  for  precise  call-graph  construction  using  class  hierarchy  analysis.  We  have  run 
this  algorithm  on  a  suit  of  benchmark  programs  which  shows  considerable  improvement  in 
the  preciseness  of  the  cad-graph. 

5.  A  code  generation  framework  and  run-time  system  has  been  developed  for  Sisal  compiler  which 
maintains  the  static  and  dynamic  ownerships  at  every  processor  to  avoid  communication 
overhead  on  ownership  information.  The  compiler  has  been  targeted  to  Intel  Touchstone 
i860  systems.  The  speed-ups  in  some  cases  are  low  compared  to  the  required  number  of 
processors,  because  an  inverted-tree-type  paradedsm  is  present  in  most  Sisal  programs.  This 
type  of  paradedsm  is  the  cause  of  overad  high  processor  demand  with  relatively  low  speed-ups. 

6.  We  have  proposed  a  Search  and  Dupdcation  Based  Algorithm(SDBS)  with  a  complexity  of 
0(y2),  where  V  is  the  number  of  tasks.  The  input  to  the  algorithm  is  the  tasks  represented 
in  the  form  of  a  Directed  Acycdc  Graph  (DAG)  and  it  produces  the  optimal  schedule  given 
sufficient  number  of  processors  and  if  certain  conditions  are  met. 

7.  We  have  also  developed  a  Scalable  and  Task  Dupdcation  based  Schedudng(STDS)  algorithm 
which  is  an  improvement  upon  SDBS  in  terms  of  scalabidty.  STDS  stid  has  an  worst  case 
complexity  of  0{V'^)  where  V  is  the  number  of  nodes  in  the  DAG.  This  algorithm  initiady 
generates  clusters  similar  to  dnear  clusters  and  uses  them  to  generate  a  new  schedule.  It 
uses  the  concept  of  dupdcating  critical  tasks  to  arrive  at  the  optimal  schedule.  If  the  number 
of  available  processors  is  less  than  the  number  of  dnear  clusters,  the  algorithm  scales  down 
nicely  and  stid  produces  a  near  optimal  schedule.  The  algorithm  has  been  appded  to  some 
practical  DAGs  dke  Cholesky  decomposition  and  its  performance  shows  improvement  over 
existing  schedudng  techniques.  The  numbers  obtained  show  that  with  decreasing  number  of 
available  processors,  the  schedule  length  goes  up  in  discrete  steps.  This  can  be  explained  by 
the  fact  that,  as  number  of  processors  go  down,  for  a  while  the  critical  path  remains  unaffected 
but  after  a  while,  with  more  merging  of  dsts,  the  critical  path  length  goes  up. 
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